A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
from pandas.api.types import is_string_dtype, is_numeric_dtype
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
pd.set_option('display.float_format', lambda x: '%.2f' % x)
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Loading the dataset - sheet_name parameter is used if there are multiple tabs in the excel file.
# I am also going to CREATE A NEW FEATURE by combining the three date columns into one.
data = pd.read_csv("INNHotelsGroup.csv", parse_dates= {"date" : ["arrival_date", "arrival_month","arrival_year"]})
data
| date | Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 10 2017 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | 6 11 2018 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | 28 2 2018 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | 20 5 2018 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | 11 4 2018 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36270 | 3 8 2018 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | Online | 0 | 0 | 0 | 167.80 | 1 | Not_Canceled |
| 36271 | 17 10 2018 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | Online | 0 | 0 | 0 | 90.95 | 2 | Canceled |
| 36272 | 1 7 2018 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | Online | 0 | 0 | 0 | 98.39 | 2 | Not_Canceled |
| 36273 | 21 4 2018 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
| 36274 | 30 12 2018 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | Offline | 0 | 0 | 0 | 161.67 | 0 | Not_Canceled |
36275 rows × 17 columns
# I am changing the new column to 'date time'
data['date']=pd.to_datetime(data['date'], errors='coerce')
# I am also going to CREATE A NEW FEATURE called 'month'
data['month'] = data['date'].dt.month
# I am also going to CREATE A NEW FEATURE called week of the year by extracting # the weeks 1-52
data['week_of_year'] = data.date.apply(lambda x: x.weekofyear)
*### Two columns have a lot of zeros, that otherwise provide valuable information: Median Fill
# In my review there are 0s in two important columns and i am going to fill those with the median as they were not cpauted @ POS
from sklearn.impute import SimpleImputer
rep_0 = SimpleImputer(missing_values=0, strategy='median')
cols = ["lead_time", "avg_price_per_room"]
imputer = rep_0.fit(data[cols])
data[cols] = imputer.transform(data[cols])
# Do we have any missing data as a %
missing_count = data.isnull().sum() # the count of missing values
value_count = data.isnull().count() # the count of all values
missing_percentage = round(
missing_count / value_count * 100, 2
) # the percentage of missing values
missing_data = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
# create a dataframe
print(missing_data)
count percentage date 37 0.10 Booking_ID 0 0.00 no_of_adults 0 0.00 no_of_children 0 0.00 no_of_weekend_nights 0 0.00 no_of_week_nights 0 0.00 type_of_meal_plan 0 0.00 required_car_parking_space 0 0.00 room_type_reserved 0 0.00 lead_time 0 0.00 market_segment_type 0 0.00 repeated_guest 0 0.00 no_of_previous_cancellations 0 0.00 no_of_previous_bookings_not_canceled 0 0.00 avg_price_per_room 0 0.00 no_of_special_requests 0 0.00 booking_status 0 0.00 month 37 0.10 week_of_year 37 0.10
# There is missing data which follows a pattern. I am going to drop those rows that don't have dates and keep the rest.
data = data.dropna(subset=['date', 'month', 'week_of_year'])
# Do we have any missing data as a %; ITS FIXED!
missing_count = data.isnull().sum() # the count of missing values
value_count = data.isnull().count() # the count of all values
missing_percentage = round(
missing_count / value_count * 100, 2
) # the percentage of missing values
missing_data = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
# create a dataframe
print(missing_data)
count percentage date 0 0.00 Booking_ID 0 0.00 no_of_adults 0 0.00 no_of_children 0 0.00 no_of_weekend_nights 0 0.00 no_of_week_nights 0 0.00 type_of_meal_plan 0 0.00 required_car_parking_space 0 0.00 room_type_reserved 0 0.00 lead_time 0 0.00 market_segment_type 0 0.00 repeated_guest 0 0.00 no_of_previous_cancellations 0 0.00 no_of_previous_bookings_not_canceled 0 0.00 avg_price_per_room 0 0.00 no_of_special_requests 0 0.00 booking_status 0 0.00 month 0 0.00 week_of_year 0 0.00
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| required_car_parking_space | 36238.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| repeated_guest | 36238.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| no_of_special_requests | 36238.00 | 0.62 | 0.79 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| month | 36238.00 | 6.96 | 3.26 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 36238 non-null datetime64[ns] 1 Booking_ID 36238 non-null object 2 no_of_adults 36238 non-null int64 3 no_of_children 36238 non-null int64 4 no_of_weekend_nights 36238 non-null int64 5 no_of_week_nights 36238 non-null int64 6 type_of_meal_plan 36238 non-null object 7 required_car_parking_space 36238 non-null int64 8 room_type_reserved 36238 non-null object 9 lead_time 36238 non-null float64 10 market_segment_type 36238 non-null object 11 repeated_guest 36238 non-null int64 12 no_of_previous_cancellations 36238 non-null int64 13 no_of_previous_bookings_not_canceled 36238 non-null int64 14 avg_price_per_room 36238 non-null float64 15 no_of_special_requests 36238 non-null int64 16 booking_status 36238 non-null object 17 month 36238 non-null float64 18 week_of_year 36238 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(9), object(5) memory usage: 5.5+ MB
data.head(5)
| date | Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | month | week_of_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-02-10 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224.00 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled | 2.00 | 6.00 |
| 1 | 2018-06-11 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5.00 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled | 6.00 | 24.00 |
| 2 | 2018-02-28 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1.00 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled | 2.00 | 9.00 |
| 3 | 2018-05-20 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211.00 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled | 5.00 | 20.00 |
| 4 | 2018-11-04 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48.00 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled | 11.00 | 44.00 |
# I am drppping Booking ID as it's a system generated number
# I am dropping the cobined date field
data = data.drop(columns=['Booking_ID'])
data = data.drop(columns=['date'])
data.head(25)
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | month | week_of_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224.00 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled | 2.00 | 6.00 |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5.00 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled | 6.00 | 24.00 |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1.00 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled | 2.00 | 9.00 |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211.00 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled | 5.00 | 20.00 |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48.00 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled | 11.00 | 44.00 |
| 5 | 2 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 346.00 | Online | 0 | 0 | 0 | 115.00 | 1 | Canceled | 9.00 | 37.00 |
| 6 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 34.00 | Online | 0 | 0 | 0 | 107.55 | 1 | Not_Canceled | 10.00 | 41.00 |
| 7 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 83.00 | Online | 0 | 0 | 0 | 105.61 | 1 | Not_Canceled | 12.00 | 52.00 |
| 8 | 3 | 0 | 0 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 121.00 | Offline | 0 | 0 | 0 | 96.90 | 1 | Not_Canceled | 6.00 | 23.00 |
| 9 | 2 | 0 | 0 | 5 | Meal Plan 1 | 0 | Room_Type 4 | 44.00 | Online | 0 | 0 | 0 | 133.44 | 3 | Not_Canceled | 10.00 | 42.00 |
| 10 | 1 | 0 | 1 | 0 | Not Selected | 0 | Room_Type 1 | 61.00 | Online | 0 | 0 | 0 | 85.03 | 0 | Not_Canceled | 11.00 | 45.00 |
| 11 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 4 | 35.00 | Online | 0 | 0 | 0 | 140.40 | 1 | Not_Canceled | 4.00 | 18.00 |
| 12 | 2 | 0 | 2 | 1 | Not Selected | 0 | Room_Type 1 | 30.00 | Online | 0 | 0 | 0 | 88.00 | 0 | Canceled | 11.00 | 48.00 |
| 13 | 1 | 0 | 2 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 95.00 | Online | 0 | 0 | 0 | 90.00 | 2 | Canceled | 11.00 | 47.00 |
| 14 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 47.00 | Online | 0 | 0 | 0 | 94.50 | 2 | Not_Canceled | 10.00 | 42.00 |
| 15 | 2 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 256.00 | Online | 0 | 0 | 0 | 115.00 | 1 | Canceled | 6.00 | 24.00 |
| 16 | 1 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 61.00 | Offline | 0 | 0 | 0 | 96.00 | 0 | Not_Canceled | 5.00 | 19.00 |
| 17 | 2 | 0 | 1 | 3 | Not Selected | 0 | Room_Type 1 | 1.00 | Online | 0 | 0 | 0 | 96.00 | 1 | Not_Canceled | 10.00 | 40.00 |
| 18 | 2 | 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 99.00 | Online | 0 | 0 | 0 | 65.00 | 0 | Canceled | 10.00 | 44.00 |
| 19 | 2 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 12.00 | Offline | 0 | 0 | 0 | 72.00 | 0 | Not_Canceled | 4.00 | 15.00 |
| 20 | 2 | 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 99.00 | Online | 0 | 0 | 0 | 65.00 | 0 | Canceled | 10.00 | 44.00 |
| 21 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 122.00 | Corporate | 0 | 0 | 0 | 67.00 | 0 | Not_Canceled | 11.00 | 47.00 |
| 22 | 2 | 0 | 2 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 2.00 | Offline | 0 | 0 | 0 | 85.00 | 0 | Not_Canceled | 3.00 | 12.00 |
| 23 | 2 | 0 | 0 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 37.00 | Offline | 0 | 0 | 0 | 105.00 | 0 | Not_Canceled | 10.00 | 41.00 |
| 24 | 2 | 0 | 2 | 1 | Not Selected | 0 | Room_Type 1 | 130.00 | Online | 0 | 0 | 0 | 94.50 | 1 | Not_Canceled | 5.00 | 21.00 |
# Do we have any missing data as a %
missing_count = data.isnull().sum() # the count of missing values
value_count = data.isnull().count() # the count of all values
missing_percentage = round(
missing_count / value_count * 100, 2
) # the percentage of missing values
missing_data = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
# create a dataframe
print(missing_data)
count percentage no_of_adults 0 0.00 no_of_children 0 0.00 no_of_weekend_nights 0 0.00 no_of_week_nights 0 0.00 type_of_meal_plan 0 0.00 required_car_parking_space 0 0.00 room_type_reserved 0 0.00 lead_time 0 0.00 market_segment_type 0 0.00 repeated_guest 0 0.00 no_of_previous_cancellations 0 0.00 no_of_previous_bookings_not_canceled 0 0.00 avg_price_per_room 0 0.00 no_of_special_requests 0 0.00 booking_status 0 0.00 month 0 0.00 week_of_year 0 0.00
# all fo the data now looks good
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| required_car_parking_space | 36238.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| repeated_guest | 36238.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| no_of_special_requests | 36238.00 | 0.62 | 0.79 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| month | 36238.00 | 6.96 | 3.26 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
# There are now 36,238 rows and 17 columns in my final data set
data.shape
(36238, 17)
# Lets calcualte the ratio of reservations that get canceled vs those that do not.
# The cancelation rate is HIGH ... more than 1/3rd of the reservatins get canceld.
n_true = len(data.loc[data["booking_status"] == "Canceled"])
n_false = len(data.loc[data["booking_status"] == "Not_Canceled"])
print(
"Number of canceled reservations: {0} ({1:2.2f}%)".format(
n_true, (n_true / (n_true + n_false)) * 100
)
)
print(
"Number of reservations not canceled: {0} ({1:2.2f}%)".format(
n_false, (n_false / (n_true + n_false)) * 100
)
)
Number of canceled reservations: 11878 (32.78%) Number of reservations not canceled: 24360 (67.22%)
# Making a list of all categorical variablest those columns that do not have a numbered order to them
cat_col = [
"type_of_meal_plan",
"room_type_reserved",
"market_segment_type",
"booking_status",
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 40)
Meal Plan 1 27802 Not Selected 5129 Meal Plan 2 3302 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64 ---------------------------------------- Room_Type 1 28105 Room_Type 4 6049 Room_Type 6 964 Room_Type 2 692 Room_Type 5 263 Room_Type 7 158 Room_Type 3 7 Name: room_type_reserved, dtype: int64 ---------------------------------------- Online 23194 Offline 10518 Corporate 2011 Complementary 390 Aviation 125 Name: market_segment_type, dtype: int64 ---------------------------------------- Not_Canceled 24360 Canceled 11878 Name: booking_status, dtype: int64 ----------------------------------------
# Lets confirm the cancelation rate again ... still 1/3rd get canceled.
n_true = len(data.loc[data["booking_status"] == "Canceled"])
n_false = len(data.loc[data["booking_status"] == "Not_Canceled"])
print(
"Number of canceled reservations: {0} ({1:2.2f}%)".format(
n_true, (n_true / (n_true + n_false)) * 100
)
)
print(
"Number of reservations not canceled: {0} ({1:2.2f}%)".format(
n_false, (n_false / (n_true + n_false)) * 100
)
)
Number of canceled reservations: 11878 (32.78%) Number of reservations not canceled: 24360 (67.22%)
# Lets create the correlation matrix to see and plot the relationships between all the numerical values.
# The # of adults, children, repeat guest, and no of previous cancelations, and number of special requests all seem to have importance.
plt.figure(figsize=(10,5))
c= data.corr()
sns.heatmap(c,cmap='Blues', annot=True)
c = c.reset_index()
c
| index | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | month | week_of_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | no_of_adults | 1.00 | -0.02 | 0.10 | 0.11 | 0.01 | 0.09 | -0.19 | -0.05 | -0.12 | 0.28 | 0.19 | 0.02 | 0.02 |
| 1 | no_of_children | -0.02 | 1.00 | 0.03 | 0.02 | 0.03 | -0.05 | -0.04 | -0.02 | -0.02 | 0.36 | 0.12 | 0.01 | 0.01 |
| 2 | no_of_weekend_nights | 0.10 | 0.03 | 1.00 | 0.18 | -0.03 | 0.04 | -0.07 | -0.02 | -0.03 | -0.03 | 0.06 | -0.00 | 0.00 |
| 3 | no_of_week_nights | 0.11 | 0.02 | 0.18 | 1.00 | -0.05 | 0.14 | -0.10 | -0.03 | -0.05 | -0.01 | 0.05 | 0.01 | 0.01 |
| 4 | required_car_parking_space | 0.01 | 0.03 | -0.03 | -0.05 | 1.00 | -0.06 | 0.11 | 0.03 | 0.06 | 0.07 | 0.09 | -0.01 | -0.01 |
| 5 | lead_time | 0.09 | -0.05 | 0.04 | 0.14 | -0.06 | 1.00 | -0.12 | -0.04 | -0.07 | -0.10 | -0.11 | 0.05 | 0.05 |
| 6 | repeated_guest | -0.19 | -0.04 | -0.07 | -0.10 | 0.11 | -0.12 | 1.00 | 0.39 | 0.54 | -0.13 | -0.01 | 0.00 | 0.00 |
| 7 | no_of_previous_cancellations | -0.05 | -0.02 | -0.02 | -0.03 | 0.03 | -0.04 | 0.39 | 1.00 | 0.47 | -0.05 | -0.00 | -0.01 | -0.01 |
| 8 | no_of_previous_bookings_not_canceled | -0.12 | -0.02 | -0.03 | -0.05 | 0.06 | -0.07 | 0.54 | 0.47 | 1.00 | -0.08 | 0.03 | 0.00 | 0.00 |
| 9 | avg_price_per_room | 0.28 | 0.36 | -0.03 | -0.01 | 0.07 | -0.10 | -0.13 | -0.05 | -0.08 | 1.00 | 0.21 | 0.04 | 0.04 |
| 10 | no_of_special_requests | 0.19 | 0.12 | 0.06 | 0.05 | 0.09 | -0.11 | -0.01 | -0.00 | 0.03 | 0.21 | 1.00 | 0.08 | 0.08 |
| 11 | month | 0.02 | 0.01 | -0.00 | 0.01 | -0.01 | 0.05 | 0.00 | -0.01 | 0.00 | 0.04 | 0.08 | 1.00 | 0.99 |
| 12 | week_of_year | 0.02 | 0.01 | 0.00 | 0.01 | -0.01 | 0.05 | 0.00 | -0.01 | 0.00 | 0.04 | 0.08 | 0.99 | 1.00 |
# This graph shows online prices have wider ranges than the rest; Aviation has a very, likely prearranged rate, narrow band.
plt.figure(figsize=(8,5))
sns.boxenplot(x='market_segment_type', y='avg_price_per_room', data=data, palette='Blues', hue='booking_status')
plt.title("Distribution of Avg Price by Segment ")
Text(0.5, 1.0, 'Distribution of Avg Price by Segment ')
# This graph shows the wider disparity in prices ... likely one explanation is people are shopping for better prices which may explain why the range is higher for most of the market segements, except aviation.
plt.figure(figsize=(8,5))
sns.boxenplot(x='market_segment_type', y='lead_time', data=data, palette='Blues', hue='booking_status')
plt.title("Distribution of Lead Time by Segment")
Text(0.5, 1.0, 'Distribution of Lead Time by Segment')
# A similar, but different picture of the sysmetry of pricing or lack of symteric results e.g. Offline cancellations tend to be lower in price than the canceled ones.
plt.figure(figsize=(10,6))
sns.violinplot(x='market_segment_type', y='avg_price_per_room', data=data, hue='booking_status', split=True,palette='Blues')
plt.title("Violin Plot of Avg Price Segment, Separated by Booking Status")
Text(0.5, 1.0, 'Violin Plot of Avg Price Segment, Separated by Booking Status')
# This shows the very uneven distrabution around how far in adavance people book there reservations by channel...e.g offline tends to book way out compared to the opposite side of Avaition which is a very small lead time.
plt.figure(figsize=(10,6))
sns.violinplot(x='market_segment_type', y='lead_time', data=data, hue='booking_status', split=True,palette='Blues')
plt.title("Violin Plot of Segment by Lead Time, Separated by Booking Status")
Text(0.5, 1.0, 'Violin Plot of Segment by Lead Time, Separated by Booking Status')
# This graph shows cancelation rates via channel ... clearly online it's easier to cancel which drives the number.
plt.figure(figsize=(8,5))
sns.countplot(x='market_segment_type',data=data, palette='Blues',hue='booking_status')
plt.title("Count of Reservations, Separated by Booking Status")
Text(0.5, 1.0, 'Count of Reservations, Separated by Booking Status')
# For the ease of booking and canceling ... Online users typically pay more per reservation and have the highest cancelation rate.
plt.figure(figsize=(8,5))
sns.barplot(x='market_segment_type',y='avg_price_per_room',data=data, palette='Blues',hue='booking_status')
plt.title("Avg Price by Segment")
Text(0.5, 1.0, 'Avg Price by Segment')
# Same story as above ... just a new cool graph!!!
plt.figure(figsize=(12,8))
sns.stripplot(x='market_segment_type',y='avg_price_per_room', data=data, jitter=True, hue='booking_status', dodge=True, palette='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c4967090>
# This graph shows prices start low in January and gradully climb to peak in July-September and then declines each month thereafter.
#create bar plot for average temps by month
plt.title('Average Room Price by Month')
sns.barplot(x='month', y='avg_price_per_room', data=data, palette='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c062dc10>
# Same as above, but another new graph type!!!
# code to plot a simple grouped barplot
plt.figure(figsize=(8, 6))
sns.barplot(
x="month",
y="avg_price_per_room",
hue="booking_status",
data=data,
palette="Blues",
)
plt.ylabel("avg_price_per_room", size=14)
plt.xlabel("arrival_month", size=14)
plt.title("Simple Grouped Barplot", size=18)
Text(0.5, 1.0, 'Simple Grouped Barplot')
# I like these! This graph shows the same as above but i think is easier to understand the relationship between avg price, month and cancelation status.
# plot data
fig, ax = plt.subplots(figsize=(15,7))
# use unstack()
data.groupby(['month','booking_status']).mean()['avg_price_per_room'].unstack().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c062d310>
# Although not perfect...there are some genral relationship between lead time and avg price.
# plot data
fig, ax = plt.subplots(figsize=(15,7))
# use unstack()
data.groupby(['lead_time','booking_status']).mean()['avg_price_per_room'].unstack().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c4e8a390>
# Again this is simple to understand, cancel or not, by room type ... we don't know what the type is as its provied to us by the compnay. But let's assume the higher the type #, the better the room is (and higher price) which tend to get canceled more than super budget rooms.
# plot data
fig, ax = plt.subplots(figsize=(15,7))
# use unstack()
data.groupby(['room_type_reserved','booking_status']).mean()['avg_price_per_room'].unstack().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c251d610>
# This shows a very big difference in canceled rooms for people that select Meal Plan 3 – Full board (breakfast, lunch, and dinner)...that is likely a factor in the higher avg price.
# plot data
fig, ax = plt.subplots(figsize=(15, 7))
# use unstack()
data.groupby(["type_of_meal_plan", "booking_status"]).mean()[
"avg_price_per_room"
].unstack().plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f45c0ea0a10>
#This confirms what we have seen ... Online is easy to book and cancel.
data.market_segment_type.hist(by=data.booking_status)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c047bd50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c0d75b50>],
dtype=object)
# Room type 1 has the largest cancelation rate.
data.room_type_reserved.hist(by=data.booking_status)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c04240d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c0d20a10>],
dtype=object)
# Same story ... cancelations rise throughout the months, and start off very low in Jan when people dont travel.
data.month.hist(by=data.booking_status)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c0ed84d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f45c17d7cd0>],
dtype=object)
Questions:
data.groupby(["month"]).size().reset_index(name='counts')
| month | counts | |
|---|---|---|
| 0 | 1.00 | 1831 |
| 1 | 2.00 | 2445 |
| 2 | 3.00 | 2595 |
| 3 | 4.00 | 2982 |
| 4 | 5.00 | 2778 |
| 5 | 6.00 | 3209 |
| 6 | 7.00 | 2811 |
| 7 | 8.00 | 3631 |
| 8 | 9.00 | 4116 |
| 9 | 10.00 | 4209 |
| 10 | 11.00 | 2655 |
| 11 | 12.00 | 2976 |
data.groupby(["market_segment_type"]).size().reset_index(name='counts')
| market_segment_type | counts | |
|---|---|---|
| 0 | Aviation | 125 |
| 1 | Complementary | 390 |
| 2 | Corporate | 2011 |
| 3 | Offline | 10518 |
| 4 | Online | 23194 |
data.groupby(["repeated_guest"]).size().reset_index(name='counts')
| repeated_guest | counts | |
|---|---|---|
| 0 | 0 | 35312 |
| 1 | 1 | 926 |
pd.pivot_table(data, columns=['market_segment_type'], aggfunc='mean').style
| market_segment_type | Aviation | Complementary | Corporate | Offline | Online |
|---|---|---|---|---|---|
| avg_price_per_room | 100.704000 | 93.911359 | 82.940318 | 91.642252 | 113.087858 |
| lead_time | 10.856000 | 31.930769 | 28.635505 | 123.763833 | 77.435932 |
| month | 7.576000 | 7.012821 | 6.695674 | 6.998954 | 6.966974 |
| no_of_adults | 1.016000 | 1.484615 | 1.230731 | 1.778095 | 1.939596 |
| no_of_children | 0.000000 | 0.125641 | 0.009945 | 0.021012 | 0.151893 |
| no_of_previous_bookings_not_canceled | 0.208000 | 2.482051 | 2.065639 | 0.010839 | 0.012115 |
| no_of_previous_cancellations | 0.040000 | 0.210256 | 0.167081 | 0.011124 | 0.013193 |
| no_of_special_requests | 0.000000 | 0.884615 | 0.221780 | 0.202795 | 0.842545 |
| no_of_week_nights | 2.856000 | 1.241026 | 1.488314 | 2.181118 | 2.289428 |
| no_of_weekend_nights | 1.160000 | 0.328205 | 0.426156 | 0.730272 | 0.886393 |
| repeated_guest | 0.128000 | 0.323077 | 0.297862 | 0.008557 | 0.004096 |
| required_car_parking_space | 0.048000 | 0.079487 | 0.090502 | 0.003233 | 0.037423 |
| week_of_year | 31.000000 | 28.612821 | 27.338140 | 28.562274 | 28.447616 |
pd.pivot_table(data, columns=['repeated_guest', 'booking_status'], aggfunc='size').to_frame(name='value').style
| value | ||
|---|---|---|
| repeated_guest | booking_status | |
| 0 | Canceled | 11863 |
| Not_Canceled | 23449 | |
| 1 | Canceled | 15 |
| Not_Canceled | 911 |
# populate the list of numeric attributes and categorical attributes
num_list = []
cat_list = []
for column in data:
if is_numeric_dtype(data[column]):
num_list.append(column)
elif is_string_dtype(data[column]):
cat_list.append(column)
print(num_list)
print(cat_list)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'month', 'week_of_year'] ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']
data[num_list].hist(figsize=(15, 15))
# set a large figsize if you have > 9 variables
plt.tight_layout()
plt.show()
# There are several variables that are skewed - no of weekend nights, lead time, # of special requests that may need to be treated as outliers.
# this is a great def function for creating combined box plots and historgrams for the numerical values.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Lets check which variables are numeric and which ones are categorical.
# There are float64(3), int64(9), object(4)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null object 15 month 36238 non-null float64 16 week_of_year 36238 non-null float64 dtypes: float64(4), int64(9), object(4) memory usage: 6.2+ MB
sns.__version__
'0.11.2'
histogram_boxplot(data, "no_of_adults")
There are 2 adults on avg for each reservations.
histogram_boxplot(data, "no_of_children")
histogram_boxplot(data, "no_of_weekend_nights")
histogram_boxplot(data, "no_of_week_nights")
histogram_boxplot(data, "required_car_parking_space")
histogram_boxplot(data, "lead_time")
Most reservations get book < 100 days out, however there are quite a few that go out >200+
histogram_boxplot(data, "no_of_special_requests")
Many reservations include 1 special request.
histogram_boxplot(data, "week_of_year")
histogram_boxplot(data, "repeated_guest")
Most reservations are not repeat customers.
histogram_boxplot(data, "no_of_special_requests")
histogram_boxplot(data, "month")
Starting in August....the hotels are busy. October os the highest volume month.
Most reservations do not have any special requests.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(ascending = True),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "booking_status", perc=True)
67% of reservations are not canceled.
labeled_barplot(data, "market_segment_type", perc=True)
! 2/3rds of all reservations are no made Online
labeled_barplot(data, "room_type_reserved", perc=True)
labeled_barplot(data, "type_of_meal_plan", perc=True)
Most people eat the most important meal of the day, breakfast and have it including in there reservation.
labeled_barplot(data, "no_of_previous_cancellations", perc=True, n=20)
99% of all reservations, the person making the reservation has not canceled in the past.
labeled_barplot(data, "no_of_previous_bookings_not_canceled", perc=True, n=20)
labeled_barplot(data, "month", perc=True, n=20)
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(data, "no_of_weekend_nights", "booking_status")
There are a higher # of weekend night reservations that do get canceled.
labeled_barplot(data, "room_type_reserved", perc=True)
#Room Type 1 gets most often selected as seen above. It appears the cancele/not cancel is the same.
distribution_plot_wrt_target(data, "week_of_year", "booking_status")
distribution_plot_wrt_target(data, "required_car_parking_space", "booking_status")
Reservations that dont require a parking spot get canceled more. This may indicate people plan more for the time to drive and are less likely to cancel.
distribution_plot_wrt_target(data, "no_of_special_requests", "booking_status")
Canceled reservations have fewer special requests that not canceld ... may show some adavabnce thought of planning.
distribution_plot_wrt_target(data, "week_of_year", "booking_status")
As the weeks progress the higher the cancelation rate is...especially around week 40.
distribution_plot_wrt_target(data, "required_car_parking_space", "booking_status")
Reservations that do not require parkeing get canceled more than those that do.
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "type_of_meal_plan", "booking_status")
booking_status Canceled Not_Canceled All type_of_meal_plan All 11878 24360 36238 Meal Plan 1 8673 19129 27802 Not Selected 1698 3431 5129 Meal Plan 2 1506 1796 3302 Meal Plan 3 1 4 5 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "room_type_reserved", "booking_status")
booking_status Canceled Not_Canceled All room_type_reserved All 11878 24360 36238 Room_Type 1 9066 19039 28105 Room_Type 4 2068 3981 6049 Room_Type 6 406 558 964 Room_Type 2 228 464 692 Room_Type 5 72 191 263 Room_Type 7 36 122 158 Room_Type 3 2 5 7 ------------------------------------------------------------------------------------------------------------------------
Room Type 6 has the highest cancelation rate.
stacked_barplot(data, "market_segment_type", "booking_status")
booking_status Canceled Not_Canceled All market_segment_type All 11878 24360 36238 Online 8469 14725 23194 Offline 3152 7366 10518 Corporate 220 1791 2011 Aviation 37 88 125 Complementary 0 390 390 ------------------------------------------------------------------------------------------------------------------------
Online resrvations get canceld most frequent ... followed by Offline and Aviation which are about equal.
stacked_barplot(data, "no_of_special_requests", "booking_status")
booking_status Canceled Not_Canceled All no_of_special_requests All 11878 24360 36238 0 8540 11211 19751 1 2701 8662 11363 2 637 3726 4363 3 0 675 675 4 0 78 78 5 0 8 8 ------------------------------------------------------------------------------------------------------------------------
Reservations that have no special request get canceled at almost 2x the rate of all other types.
stacked_barplot(data, "no_of_previous_cancellations", "booking_status")
booking_status Canceled Not_Canceled All no_of_previous_cancellations All 11878 24360 36238 0 11863 24038 35901 1 10 187 197 13 4 0 4 3 1 42 43 2 0 46 46 4 0 10 10 5 0 11 11 6 0 1 1 11 0 25 25 ------------------------------------------------------------------------------------------------------------------------
Poeple that have 13 prior cancelation will cancel, followed by ~35% of all reswervations with no idication will get canceled.
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Another view of the prior matrix with a different color scheme to highlight the variables that are correlated.
NOTE: WE WILL HAVE TO DROP EITHER MONTH OR WEEK BC THE ARE ~PERFECTLY CORRELATED
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null object 15 month 36238 non-null float64 16 week_of_year 36238 non-null float64 dtypes: float64(4), int64(9), object(4) memory usage: 6.2+ MB
# populate the list of numeric attributes and categorical attributes
num_list = []
cat_list = []
for column in data:
if is_numeric_dtype(data[column]):
num_list.append(column)
elif is_string_dtype(data[column]):
cat_list.append(column)
print(num_list)
print(cat_list)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'month', 'week_of_year'] ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null object 15 month 36238 non-null float64 16 week_of_year 36238 non-null float64 dtypes: float64(4), int64(9), object(4) memory usage: 6.2+ MB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| required_car_parking_space | 36238.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| repeated_guest | 36238.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| no_of_special_requests | 36238.00 | 0.62 | 0.79 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| month | 36238.00 | 6.96 | 3.26 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
labeled_barplot(data, "week_of_year", perc=True, n=20)
labeled_barplot(data, "no_of_previous_cancellations", perc=True, n=20)
labeled_barplot(data, "no_of_previous_bookings_not_canceled", perc=True, n=20)
labeled_barplot(data, "avg_price_per_room", perc=True, n=20)
labeled_barplot(data, "lead_time", perc=True, n=20, )
sns.pairplot(data, hue="booking_status")
plt.show()
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Drop target, date and categorical attributes
df_num = data.drop(
['type_of_meal_plan', 'required_car_parking_space',
'room_type_reserved', 'market_segment_type', 'repeated_guest', 'no_of_special_requests'],
axis=1,
).describe()
df_num.T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| month | 36238.00 | 6.96 | 3.26 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
#Outlier Detection
# Note: The Zscore method b/c std dev is 3 shows no outliers ... I am conflicted, but will proceed with outlier treatment.
from scipy.stats import zscore
# Compute absolute z-scores
zscore = np.abs(zscore(df_num))
df_zscores = pd.DataFrame(data=zscore, columns=df_num.columns)
# Sum z-scores greater than 3 and group by feature
outliers = pd.DataFrame(
df_zscores[df_zscores > 3].count(), columns=["Number of Outliers"]
)
# Calculate percentage make up in each feature
outliers["Percentage in Feature"] = (
outliers["Number of Outliers"].apply(lambda x: x / data.shape[0] * 100).round(2)
)
outliers = outliers.reset_index()
outliers
| index | Number of Outliers | Percentage in Feature | |
|---|---|---|---|
| 0 | no_of_adults | 0 | 0.00 |
| 1 | no_of_children | 0 | 0.00 |
| 2 | no_of_weekend_nights | 0 | 0.00 |
| 3 | no_of_week_nights | 0 | 0.00 |
| 4 | lead_time | 0 | 0.00 |
| 5 | no_of_previous_cancellations | 0 | 0.00 |
| 6 | no_of_previous_bookings_not_canceled | 0 | 0.00 |
| 7 | avg_price_per_room | 0 | 0.00 |
| 8 | month | 0 | 0.00 |
| 9 | week_of_year | 0 | 0.00 |
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Month and Week are ~99% correlated so we need to drop 1
data.drop(['month'], axis = 1, inplace = True)
# After the preprocssing steps we end up with dtypes: float64(3), int64(9), object(4)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null object 15 week_of_year 36238 non-null float64 dtypes: float64(3), int64(9), object(4) memory usage: 6.0+ MB
# Let's check the cancelation ratio to make sure nothing has changed while we were doing the data changes.
n_true = len(data.loc[data["booking_status"] == 'Canceled'])
n_false = len(data.loc[data["booking_status"] == 'Not_Canceled'])
print(
"Number of canceled reservations: {0} ({1:2.2f}%)".format(
n_true, (n_true / (n_true + n_false)) * 100
)
)
print(
"Number of reservations not canceled: {0} ({1:2.2f}%)".format(
n_false, (n_false / (n_true + n_false)) * 100
)
)
Number of canceled reservations: 11878 (32.78%) Number of reservations not canceled: 24360 (67.22%)
numerical_col
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'month', 'week_of_year']
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
#Let's keep checking data types
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| required_car_parking_space | 36238.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| repeated_guest | 36238.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| no_of_special_requests | 36238.00 | 0.62 | 0.79 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
#Per the post outlier treatment there are a number of varibales that should be dropped
#data.drop(['no_of_previous_cancellations'], axis = 1, inplace = True)
#data.drop(['no_of_previous_bookings_not_canceled'], axis = 1, inplace = True)
#data.drop(['no_of_adults'], axis = 1, inplace = True)
#data.drop(['no_of_children'], axis = 1, inplace = True)
# Encoding and replacing the words ' Not_Canceled and Canceled' with 0 and 1 to match required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes) convention
data["booking_status"] = data["booking_status"].replace("Not_Canceled", 0)
data["booking_status"] = data["booking_status"].replace("Canceled", 1)
data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Meal Plan ","") data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Not Selected",0) data
data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Meal Plan","") data["type_of_meal_plan"] = data["type_of_meal_plan"].replace("Not Selected",0)
### Im going to create a copy of the data @ this point and use it for the decision tree analysis - let's create a copy of the data before it gets treated for outliers so that we can use all the prior changes w/ the outlier adjustments.
data2 = data.copy()
#Creating training and test sets.
X = data.drop("booking_status", axis=1)
Y = data["booking_status"]
# creating dummy variables ... this function will create dummies for both Objects & Categories and we are dropping the first column bc all the information is present with the others.
X = pd.get_dummies(X, columns=X.select_dtypes(include=['object', 'category']).columns.tolist(), drop_first=True)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25366, 25) Shape of test set : (10872, 25) Percentage of classes in training set: 0 0.67 1 0.33 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.68 1 0.32 Name: booking_status, dtype: float64
# the test and training sample sizes are almost (.67 vs .68) equal, so we can proceed with the model as the distrabutions are good.
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
model = LogisticRegression(solver="newton-cg", random_state=1)
lg = model.fit(X_train, y_train)
# predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision:", precision_score(y_train, y_pred_train))
print("Recall:", recall_score(y_train, y_pred_train))
print("F1:", f1_score(y_train, y_pred_train))
Training set performance: Accuracy: 0.7980367420957187 Precision: 0.7341551849166063 Recall: 0.6060823754789272 F1: 0.6639994753066176
Accuracy is ok at ~80 ... which shows the model is over fitting compared to the recall # of .62%. More work can be done.
# predicting on the test set
y_pred_test = lg.predict(X_test)
print("Test set performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision:", precision_score(y_test, y_pred_test))
print("Recall:", recall_score(y_test, y_pred_test))
print("F1:", f1_score(y_test, y_pred_test))
Test set performance: Accuracy: 0.7987490802060339 Precision: 0.7162249515190692 Recall: 0.6284741917186614 F1: 0.6694864048338369
The test results are almost the same ... Accuracy is ok at ~.80 ... which shows the model is over fitting compared to the recall # of .64 More work can be done.
Recall on the train and test sets are comparable.
This shows that the model is giving a generalised result.
# In this model the dependedent varibale is "booking status' so we need to drop that from the data set.
X = data.drop("booking_status", axis=1)
Y = data["booking_status"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# adding constant
X = sm.add_constant(X)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
X.head(5)
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | week_of_year | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00 | 2 | 0 | 1 | 2 | 0 | 224.00 | 0 | 0 | 0 | 65.00 | 0 | 6.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1.00 | 2 | 0 | 2 | 3 | 0 | 5.00 | 0 | 0 | 0 | 106.68 | 1 | 24.00 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1.00 | 1 | 0 | 2 | 1 | 0 | 1.00 | 0 | 0 | 0 | 60.00 | 0 | 9.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1.00 | 2 | 0 | 0 | 2 | 0 | 211.00 | 0 | 0 | 0 | 100.00 | 0 | 20.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 1.00 | 2 | 0 | 1 | 1 | 0 | 48.00 | 0 | 0 | 0 | 94.50 | 0 | 44.00 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25366
Model: Logit Df Residuals: 25340
Method: MLE Df Model: 25
Date: Fri, 19 Nov 2021 Pseudo R-squ.: 0.3165
Time: 22:56:24 Log-Likelihood: -10986.
converged: False LL-Null: -16073.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -2.7129 0.253 -10.721 0.000 -3.209 -2.217
no_of_adults 0.1062 0.037 2.863 0.004 0.033 0.179
no_of_children 0.2652 0.059 4.522 0.000 0.150 0.380
no_of_weekend_nights 0.1595 0.020 8.142 0.000 0.121 0.198
no_of_week_nights 0.0311 0.012 2.583 0.010 0.007 0.055
required_car_parking_space -1.5030 0.132 -11.383 0.000 -1.762 -1.244
lead_time 0.0156 0.000 62.075 0.000 0.015 0.016
repeated_guest -3.0909 0.620 -4.985 0.000 -4.306 -1.876
no_of_previous_cancellations 0.3082 0.075 4.089 0.000 0.160 0.456
no_of_previous_bookings_not_canceled -0.0074 0.061 -0.122 0.903 -0.126 0.111
avg_price_per_room 0.0179 0.001 24.881 0.000 0.017 0.019
no_of_special_requests -1.4660 0.030 -49.364 0.000 -1.524 -1.408
week_of_year -0.0037 0.001 -3.076 0.002 -0.006 -0.001
type_of_meal_plan_Meal Plan 2 0.0992 0.063 1.567 0.117 -0.025 0.223
type_of_meal_plan_Meal Plan 3 -11.4274 383.312 -0.030 0.976 -762.705 739.850
type_of_meal_plan_Not Selected 0.2756 0.052 5.288 0.000 0.173 0.378
room_type_reserved_Room_Type 2 -0.5085 0.132 -3.855 0.000 -0.767 -0.250
room_type_reserved_Room_Type 3 -0.0495 1.211 -0.041 0.967 -2.423 2.324
room_type_reserved_Room_Type 4 -0.1623 0.053 -3.084 0.002 -0.265 -0.059
room_type_reserved_Room_Type 5 -0.5010 0.206 -2.429 0.015 -0.905 -0.097
room_type_reserved_Room_Type 6 -1.0058 0.149 -6.743 0.000 -1.298 -0.713
room_type_reserved_Room_Type 7 -1.2870 0.308 -4.178 0.000 -1.891 -0.683
market_segment_type_Complementary -54.6269 6.39e+10 -8.55e-10 1.000 -1.25e+11 1.25e+11
market_segment_type_Corporate -1.1388 0.255 -4.467 0.000 -1.639 -0.639
market_segment_type_Offline -2.0814 0.243 -8.554 0.000 -2.558 -1.604
market_segment_type_Online -0.3463 0.240 -1.441 0.149 -0.817 0.125
========================================================================================================
It ran!!! AFter all of that 'singualr matrix' error codes, i simpliefied the model and went back from the beginning and the work was worht it. Based on the intial results there are some P values >.05 that will need to be inspected closer.
X_train.head(5)
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | week_of_year | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4124 | 1.00 | 2 | 0 | 0 | 1 | 0 | 289.00 | 0 | 0 | 0 | 67.00 | 0 | 42.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 31349 | 1.00 | 3 | 0 | 0 | 4 | 0 | 107.00 | 0 | 0 | 0 | 152.10 | 0 | 34.00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10601 | 1.00 | 2 | 0 | 0 | 1 | 0 | 4.00 | 0 | 0 | 0 | 90.00 | 0 | 7.00 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26095 | 1.00 | 2 | 0 | 0 | 4 | 0 | 52.00 | 0 | 0 | 0 | 63.75 | 0 | 36.00 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 9017 | 1.00 | 2 | 0 | 1 | 2 | 0 | 142.00 | 0 | 0 | 0 | 125.33 | 1 | 36.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Observations
Negative values of the coefficient shows that probability of customer canceling decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that that probability of customer canceling increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
We will have to remove multicollinearity from the data to get reliable coefficients and p-values.
There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearity that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: const 303.79 no_of_adults 1.33 no_of_children 2.00 no_of_weekend_nights 1.07 no_of_week_nights 1.10 required_car_parking_space 1.03 lead_time 1.20 repeated_guest 1.80 no_of_previous_cancellations 1.28 no_of_previous_bookings_not_canceled 1.56 avg_price_per_room 1.83 no_of_special_requests 1.24 week_of_year 1.02 type_of_meal_plan_Meal Plan 2 1.22 type_of_meal_plan_Meal Plan 3 1.03 type_of_meal_plan_Not Selected 1.24 room_type_reserved_Room_Type 2 1.10 room_type_reserved_Room_Type 3 1.00 room_type_reserved_Room_Type 4 1.35 room_type_reserved_Room_Type 5 1.03 room_type_reserved_Room_Type 6 2.01 room_type_reserved_Room_Type 7 1.12 market_segment_type_Complementary 4.02 market_segment_type_Corporate 15.62 market_segment_type_Offline 58.96 market_segment_type_Online 65.37 dtype: float64
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, X_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'week_of_year', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Offline', 'market_segment_type_Online']
# creating a new training set
X_train3 = X_train[
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations',
'avg_price_per_room', 'no_of_special_requests', 'week_of_year',
'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2',
'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5',
'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7',
'market_segment_type_Offline', 'market_segment_type_Online']
].astype(float)
logit3 = sm.Logit(y_train, X_train3)
lg3 = logit3.fit(disp=False)
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25366
Model: Logit Df Residuals: 25346
Method: MLE Df Model: 19
Date: Fri, 19 Nov 2021 Pseudo R-squ.: 0.3148
Time: 22:56:27 Log-Likelihood: -11014.
converged: True LL-Null: -16073.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -3.9163 0.121 -32.379 0.000 -4.153 -3.679
no_of_adults 0.0919 0.037 2.491 0.013 0.020 0.164
no_of_children 0.2598 0.058 4.446 0.000 0.145 0.374
no_of_weekend_nights 0.1653 0.020 8.456 0.000 0.127 0.204
no_of_week_nights 0.0357 0.012 2.980 0.003 0.012 0.059
required_car_parking_space -1.4982 0.132 -11.347 0.000 -1.757 -1.239
lead_time 0.0156 0.000 62.486 0.000 0.015 0.016
repeated_guest -3.0949 0.577 -5.362 0.000 -4.226 -1.964
no_of_previous_cancellations 0.3061 0.075 4.062 0.000 0.158 0.454
avg_price_per_room 0.0182 0.001 26.020 0.000 0.017 0.020
no_of_special_requests -1.4674 0.030 -49.504 0.000 -1.525 -1.409
week_of_year -0.0036 0.001 -2.968 0.003 -0.006 -0.001
type_of_meal_plan_Not Selected 0.2779 0.052 5.342 0.000 0.176 0.380
room_type_reserved_Room_Type 2 -0.5137 0.132 -3.903 0.000 -0.772 -0.256
room_type_reserved_Room_Type 4 -0.1522 0.052 -2.914 0.004 -0.255 -0.050
room_type_reserved_Room_Type 5 -0.5289 0.204 -2.591 0.010 -0.929 -0.129
room_type_reserved_Room_Type 6 -1.0189 0.148 -6.863 0.000 -1.310 -0.728
room_type_reserved_Room_Type 7 -1.3194 0.307 -4.303 0.000 -1.920 -0.718
market_segment_type_Offline -0.8737 0.096 -9.139 0.000 -1.061 -0.686
market_segment_type_Online 0.8321 0.093 8.986 0.000 0.651 1.014
==================================================================================================
X_train3.head(2)
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | week_of_year | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4124 | 1.00 | 2.00 | 0.00 | 0.00 | 1.00 | 0.00 | 289.00 | 0.00 | 0.00 | 67.00 | 0.00 | 42.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| 31349 | 1.00 | 3.00 | 0.00 | 0.00 | 4.00 | 0.00 | 107.00 | 0.00 | 0.00 | 152.10 | 0.00 | 34.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg3 as final model.
# converting coefficients to odds
odds = np.exp(lg3.params)
# finding the percentage change
perc_change_odds = (np.exp(lg3.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | week_of_year | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.02 | 1.10 | 1.30 | 1.18 | 1.04 | 0.22 | 1.02 | 0.05 | 1.36 | 1.02 | 0.23 | 1.00 | 1.32 | 0.60 | 0.86 | 0.59 | 0.36 | 0.27 | 0.42 | 2.30 |
| Change_odd% | -98.01 | 9.62 | 29.67 | 17.98 | 3.64 | -77.65 | 1.58 | -95.47 | 35.82 | 1.84 | -76.95 | -0.36 | 32.04 | -40.17 | -14.12 | -41.07 | -63.90 | -73.27 | -58.26 | 129.82 |
Interpretation for other attributes can be done similarly.
# creating confusion matrix
confusion_matrix_statsmodels(lg3, X_train3, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg3, X_train3, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80 | 0.61 | 0.73 | 0.67 |
logit_roc_auc_train = roc_auc_score(y_train, lg3.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3294951995844779
# creating confusion matrix
confusion_matrix_statsmodels(
lg3, X_train3, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg3, X_train3, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.78 | 0.76 | 0.63 | 0.69 |
logit_roc_auc_train = roc_auc_score(y_train, lg3.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
y_scores = lg3.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg3, X_train3, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg3, X_train3, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79 | 0.69 | 0.68 | 0.68 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.35 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df = models_train_comp_df.reset_index()
models_train_comp_df
Training performance comparison:
| index | Logistic Regression sklearn | Logistic Regression-0.35 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|---|
| 0 | Accuracy | 0.80 | 0.78 | 0.79 |
| 1 | Recall | 0.61 | 0.76 | 0.69 |
| 2 | Precision | 0.73 | 0.63 | 0.68 |
| 3 | F1 | 0.67 | 0.69 | 0.68 |
Dropping the columns from the test set that were dropped from the training set
X_test3 = X_test[X_train3.columns].astype(float)
Using model with default threshold
# creating confusion matrix
confusion_matrix_statsmodels(lg3, X_test3, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg3, X_test3, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80 | 0.63 | 0.71 | 0.67 |
logit_roc_auc_train = roc_auc_score(y_test, lg3.predict(X_test3))
fpr, tpr, thresholds = roc_curve(y_test, lg3.predict(X_test3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(lg3, X_test3, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg3, X_test3, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.77 | 0.77 | 0.62 | 0.69 |
Using model with threshold = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg3, X_test3, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg3, X_test3, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79 | 0.70 | 0.67 | 0.68 |
models_train_comp_df = pd.concat( [ log_reg_model_train_perf.T, log_reg_model_train_perf_threshold_auc_roc.T, log_reg_model_train_perf_threshold_curve.T, ], axis=1, ) models_train_comp_df.columns = [ "Logistic Regression sklearn", "Logistic Regression-0.35 Threshold", "Logistic Regression-0.42 Threshold", ]
print("Training performance comparison:") models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.33 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.33 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80 | 0.77 | 0.79 |
| Recall | 0.63 | 0.77 | 0.70 |
| Precision | 0.71 | 0.62 | 0.67 |
| F1 | 0.67 | 0.69 | 0.68 |
To drive the likeihood of decreasing cancelations build pricing and programs around:
Online booking is barrier free, and most of the cancelations come from that segment:
===========================================
#This is a new copy of the data with changes that were done to keep the data sets seperate.
data2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null int64 15 week_of_year 36238 non-null float64 dtypes: float64(3), int64(10), object(3) memory usage: 6.0+ MB
# Follow these steps to remove spaces-special charachters-punction
cleaned_column_names = (
data2.columns.str.strip()
.str.replace("((?<=[a-z0-9])[A-Z]|(?!^)[A-Z](?=[a-z]))", r"_\1")
.str.lower()
.str.replace("[ _-]+", "_")
.str.replace("[}{)(><.!?\\\\:;,-]", "")
)
data2.columns = cleaned_column_names
data2["type_of_meal_plan"] = data2["type_of_meal_plan"].str.replace("Meal Plan",'') data2["type_of_meal_plan"] = data2["type_of_meal_plan"].replace("Not Selected", 0)
data2['room_type_reserved'] = data2['room_type_reserved'].str.replace('Room_Type','')
data2["booking_status"] = data2["booking_status"].replace("Not_Canceled", 0) data2["booking_status"] = data2["booking_status"].replace("Canceled", 1)
#Change the market segment to numbers 1(online) 2(offline) 3(Corp) 4(comp) and 5(Aviation)
data2.market_segment_type.value_counts()
Online 23194 Offline 10518 Corporate 2011 Complementary 390 Aviation 125 Name: market_segment_type, dtype: int64
data2['market_segment_type'] = data2['market_segment_type'].replace("Online", 0) data2['market_segment_type'] = data2['market_segment_type'].replace("Offline", 1) data2['market_segment_type'] = data2['market_segment_type'].replace("Corporate", 2) data2['market_segment_type'] = data2['market_segment_type'].replace("Complementary", 3) data2['market_segment_type'] = data2['market_segment_type'].replace("Aviation", 4) data2.market_segment_type.value_counts()
data2.describe(include=["object", "bool"])
| type_of_meal_plan | room_type_reserved | market_segment_type | |
|---|---|---|---|
| count | 36238 | 36238 | 36238 |
| unique | 4 | 7 | 5 |
| top | Meal Plan 1 | Room_Type 1 | Online |
| freq | 27802 | 28105 | 23194 |
data2.tail(10)
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | week_of_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36265 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 15.00 | Online | 0 | 0 | 0 | 100.73 | 0 | 0 | 22.00 |
| 36266 | 2 | 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 2 | 8.00 | Online | 0 | 0 | 0 | 85.96 | 1 | 1 | 14.00 |
| 36267 | 2 | 0 | 1 | 0 | Not Selected | 0 | Room_Type 1 | 49.00 | Online | 0 | 0 | 0 | 93.15 | 0 | 1 | 45.00 |
| 36268 | 1 | 0 | 0 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 166.00 | Offline | 0 | 0 | 0 | 110.00 | 0 | 1 | 2.00 |
| 36269 | 2 | 2 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 6 | 61.00 | Online | 0 | 0 | 0 | 216.00 | 0 | 1 | 23.00 |
| 36270 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85.00 | Online | 0 | 0 | 0 | 167.80 | 1 | 0 | 10.00 |
| 36271 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228.00 | Online | 0 | 0 | 0 | 90.95 | 2 | 1 | 42.00 |
| 36272 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148.00 | Online | 0 | 0 | 0 | 98.39 | 2 | 0 | 1.00 |
| 36273 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63.00 | Online | 0 | 0 | 0 | 94.50 | 0 | 1 | 16.00 |
| 36274 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207.00 | Offline | 0 | 0 | 0 | 161.67 | 0 | 0 | 52.00 |
data2.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36238.00 | 1.85 | 0.52 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 |
| no_of_children | 36238.00 | 0.11 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| no_of_weekend_nights | 36238.00 | 0.81 | 0.87 | 0.00 | 0.00 | 1.00 | 2.00 | 7.00 |
| no_of_week_nights | 36238.00 | 2.20 | 1.41 | 0.00 | 1.00 | 2.00 | 3.00 | 17.00 |
| required_car_parking_space | 36238.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| lead_time | 36238.00 | 87.45 | 84.52 | 1.00 | 22.00 | 61.00 | 126.00 | 443.00 |
| repeated_guest | 36238.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| no_of_previous_cancellations | 36238.00 | 0.02 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 |
| no_of_previous_bookings_not_canceled | 36238.00 | 0.15 | 1.75 | 0.00 | 0.00 | 0.00 | 0.00 | 58.00 |
| avg_price_per_room | 36238.00 | 104.94 | 32.68 | 0.50 | 81.00 | 100.00 | 120.00 | 540.00 |
| no_of_special_requests | 36238.00 | 0.62 | 0.79 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| booking_status | 36238.00 | 0.33 | 0.47 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| week_of_year | 36238.00 | 28.43 | 14.33 | 1.00 | 16.00 | 30.00 | 41.00 | 52.00 |
plt.figure(figsize=(15, 7))
sns.heatmap(data2.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
data2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36238 entries, 0 to 36274 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36238 non-null int64 1 no_of_children 36238 non-null int64 2 no_of_weekend_nights 36238 non-null int64 3 no_of_week_nights 36238 non-null int64 4 type_of_meal_plan 36238 non-null object 5 required_car_parking_space 36238 non-null int64 6 room_type_reserved 36238 non-null object 7 lead_time 36238 non-null float64 8 market_segment_type 36238 non-null object 9 repeated_guest 36238 non-null int64 10 no_of_previous_cancellations 36238 non-null int64 11 no_of_previous_bookings_not_canceled 36238 non-null int64 12 avg_price_per_room 36238 non-null float64 13 no_of_special_requests 36238 non-null int64 14 booking_status 36238 non-null int64 15 week_of_year 36238 non-null float64 dtypes: float64(3), int64(10), object(3) memory usage: 6.0+ MB
dummy_data = pd.get_dummies( data2, columns=[ "type_of_meal_plan", "room_type_reserved",
],
drop_first=True,
) dummy_data.head()
#Creating training and test sets.
X = data2.drop("booking_status", axis=1)
Y = data2["booking_status"]
# creating dummy variables ... this function will create dummies for both Objects & Categories and we are dropping the first column bc all the information is present with the others.
X = pd.get_dummies(X, columns=X.select_dtypes(include=['object', 'category']).columns.tolist(), drop_first=True)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25366, 25) Shape of test set : (10872, 25) Percentage of classes in training set: 0 0.67 1 0.33 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.68 1 0.32 Name: booking_status, dtype: float64
## Function to calculate recall score
def get_recall_score(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
prediction = model.predict(predictors)
return recall_score(target, prediction)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = get_recall_score(model, X_train, y_train)
print("Recall Score:", decision_tree_perf_train)
Recall Score: 0.9977250957854407
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = get_recall_score(model, X_test, y_test)
print("Recall Score:", decision_tree_perf_test)
Recall Score: 0.7980714690867838
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30)) out = tree.plot_tree( model, feature_names=feature_names, filled=True, fontsize=9, node_ids=False, class_names=None, )
for o in out: arrow = o.arrow_patch if arrow is not None: arrow.set_edgecolor("black") arrow.set_linewidth(1) plt.show()
THIS IS SUPER IMPOSSIBLE TO READ ;)
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- repeated_guest <= 0.50 | | | | | | |--- market_segment_type_Complementary <= 0.50 | | | | | | | |--- avg_price_per_room <= 88.60 | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | |--- week_of_year <= 7.50 | | | | | | | | | | |--- avg_price_per_room <= 65.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 65.50 | | | | | | | | | | | |--- weights: [5.55, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 7.50 | | | | | | | | | | |--- lead_time <= 16.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- lead_time > 16.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | | |--- weights: [7.65, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 88.60 | | | | | | | | |--- week_of_year <= 24.50 | | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | | | | | |--- week_of_year > 24.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- week_of_year <= 43.50 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | | |--- week_of_year > 43.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- lead_time <= 12.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 12.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- market_segment_type_Complementary > 0.50 | | | | | | | |--- weights: [16.05, 0.00] class: 0 | | | | | |--- repeated_guest > 0.50 | | | | | | |--- no_of_previous_cancellations <= 2.50 | | | | | | | |--- weights: [41.25, 0.00] class: 0 | | | | | | |--- no_of_previous_cancellations > 2.50 | | | | | | | |--- no_of_previous_bookings_not_canceled <= 12.50 | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | |--- lead_time <= 47.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 47.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | |--- no_of_previous_bookings_not_canceled > 12.50 | | | | | | | | |--- no_of_previous_cancellations <= 3.50 | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | | |--- no_of_previous_cancellations > 3.50 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- avg_price_per_room <= 199.01 | | | | | | |--- weights: [276.00, 0.00] class: 0 | | | | | |--- avg_price_per_room > 199.01 | | | | | | |--- lead_time <= 22.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- lead_time > 22.00 | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- lead_time <= 65.50 | | | | | | |--- lead_time <= 1.50 | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- weights: [0.00, 24.65] class: 1 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 100.00 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 100.00 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | | |--- lead_time > 1.50 | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | |--- week_of_year <= 40.50 | | | | | | | | | |--- lead_time <= 60.50 | | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | | |--- truncated branch of depth 18 | | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- lead_time > 60.50 | | | | | | | | | | |--- avg_price_per_room <= 93.74 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 93.74 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- week_of_year > 40.50 | | | | | | | | | |--- avg_price_per_room <= 64.90 | | | | | | | | | | |--- weights: [8.85, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 64.90 | | | | | | | | | | |--- week_of_year <= 44.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- week_of_year > 44.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | |--- week_of_year <= 10.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- week_of_year > 10.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [0.00, 11.90] class: 1 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- lead_time > 65.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- week_of_year <= 39.50 | | | | | | | | |--- week_of_year <= 9.00 | | | | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | | | | | |--- week_of_year > 9.00 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- avg_price_per_room <= 63.07 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 63.07 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- week_of_year <= 27.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- week_of_year > 27.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- week_of_year > 39.50 | | | | | | | | |--- avg_price_per_room <= 92.27 | | | | | | | | | |--- weights: [6.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 92.27 | | | | | | | | | |--- week_of_year <= 44.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 44.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- lead_time <= 85.00 | | | | | | | | |--- avg_price_per_room <= 128.00 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 128.00 | | | | | | | | | |--- lead_time <= 72.50 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- lead_time > 72.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- lead_time > 85.00 | | | | | | | | |--- lead_time <= 88.50 | | | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | | | | |--- lead_time > 88.50 | | | | | | | | | |--- week_of_year <= 29.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 29.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- no_of_special_requests > 0.50 | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | |--- no_of_weekend_nights <= 4.00 | | | | | | | |--- avg_price_per_room <= 126.00 | | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | | |--- weights: [69.75, 0.00] class: 0 | | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 126.00 | | | | | | | | |--- week_of_year <= 20.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- week_of_year > 20.50 | | | | | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | | | |--- no_of_weekend_nights > 4.00 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | |--- market_segment_type_Online > 0.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- avg_price_per_room <= 121.30 | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | |--- avg_price_per_room <= 76.29 | | | | | | | | |--- weights: [9.45, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 76.29 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- week_of_year <= 40.50 | | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- week_of_year > 40.50 | | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | |--- week_of_year <= 6.00 | | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | | | |--- week_of_year > 6.00 | | | | | | | | |--- week_of_year <= 29.00 | | | | | | | | | |--- week_of_year <= 8.50 | | | | | | | | | | |--- avg_price_per_room <= 80.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- avg_price_per_room > 80.00 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 8.50 | | | | | | | | | | |--- no_of_children <= 1.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- no_of_children > 1.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- week_of_year > 29.00 | | | | | | | | | |--- no_of_week_nights <= 6.00 | | | | | | | | | | |--- avg_price_per_room <= 80.17 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- avg_price_per_room > 80.17 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 6.00 | | | | | | | | | | |--- week_of_year <= 49.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- week_of_year > 49.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- avg_price_per_room > 121.30 | | | | | | |--- lead_time <= 2.50 | | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- avg_price_per_room <= 130.50 | | | | | | | | | | |--- weights: [2.40, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 130.50 | | | | | | | | | | |--- avg_price_per_room <= 139.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 139.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | |--- lead_time > 2.50 | | | | | | | |--- week_of_year <= 49.50 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- week_of_year <= 44.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- week_of_year > 44.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- week_of_year > 49.50 | | | | | | | | |--- room_type_reserved_Room_Type 6 <= 0.50 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | |--- room_type_reserved_Room_Type 6 > 0.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 105.28 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- week_of_year <= 49.50 | | | | | | | | | |--- week_of_year <= 6.50 | | | | | | | | | | |--- avg_price_per_room <= 78.09 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 78.09 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- week_of_year > 6.50 | | | | | | | | | | |--- avg_price_per_room <= 78.90 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- avg_price_per_room > 78.90 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | | | |--- week_of_year > 49.50 | | | | | | | | | |--- lead_time <= 24.50 | | | | | | | | | | |--- weights: [7.35, 0.00] class: 0 | | | | | | | | | |--- lead_time > 24.50 | | | | | | | | | | |--- avg_price_per_room <= 100.00 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 100.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 63.52 | | | | | | | | | |--- lead_time <= 84.50 | | | | | | | | | | |--- week_of_year <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- week_of_year > 24.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- lead_time > 84.50 | | | | | | | | | | |--- avg_price_per_room <= 47.63 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 47.63 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- avg_price_per_room > 63.52 | | | | | | | | | |--- lead_time <= 60.50 | | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 22 | | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | |--- lead_time > 60.50 | | | | | | | | | | |--- lead_time <= 61.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- lead_time > 61.50 | | | | | | | | | | | |--- truncated branch of depth 19 | | | | | | |--- avg_price_per_room > 105.28 | | | | | | | |--- week_of_year <= 51.50 | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 200.97 | | | | | | | | | | |--- avg_price_per_room <= 199.45 | | | | | | | | | | | |--- truncated branch of depth 28 | | | | | | | | | | |--- avg_price_per_room > 199.45 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 200.97 | | | | | | | | | | |--- week_of_year <= 4.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- week_of_year > 4.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | |--- week_of_year <= 24.50 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 24.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- week_of_year > 51.50 | | | | | | | | |--- lead_time <= 24.50 | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | | |--- lead_time > 24.50 | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 2.00 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 2.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 115.75 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 115.75 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- avg_price_per_room <= 209.83 | | | | | | | |--- weights: [9.30, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 209.83 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- lead_time <= 4.50 | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | |--- week_of_year <= 34.50 | | | | | | | |--- lead_time <= 2.50 | | | | | | | | |--- week_of_year <= 21.50 | | | | | | | | | |--- week_of_year <= 18.50 | | | | | | | | | | |--- avg_price_per_room <= 82.75 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- avg_price_per_room > 82.75 | | | | | | | | | | | |--- weights: [11.85, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 18.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- week_of_year > 21.50 | | | | | | | | | |--- weights: [11.55, 0.00] class: 0 | | | | | | | |--- lead_time > 2.50 | | | | | | | | |--- weights: [22.05, 0.00] class: 0 | | | | | | |--- week_of_year > 34.50 | | | | | | | |--- week_of_year <= 48.50 | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | |--- week_of_year <= 36.50 | | | | | | | | | | |--- avg_price_per_room <= 128.50 | | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 128.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- week_of_year > 36.50 | | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | | |--- weights: [6.45, 0.00] class: 0 | | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 113.25 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- avg_price_per_room > 113.25 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- week_of_year <= 41.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- week_of_year > 41.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | |--- week_of_year > 48.50 | | | | | | | | |--- weights: [7.35, 0.00] class: 0 | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | |--- no_of_weekend_nights <= 5.00 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- no_of_weekend_nights > 5.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- lead_time > 4.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- week_of_year <= 50.50 | | | | | | | |--- avg_price_per_room <= 121.78 | | | | | | | | |--- lead_time <= 61.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_weekend_nights <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 19 | | | | | | | | | | |--- no_of_weekend_nights > 2.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 67.33 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 67.33 | | | | | | | | | | | |--- truncated branch of depth 21 | | | | | | | | |--- lead_time > 61.50 | | | | | | | | | |--- week_of_year <= 42.50 | | | | | | | | | | |--- no_of_weekend_nights <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 22 | | | | | | | | | | |--- no_of_weekend_nights > 2.50 | | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | | |--- week_of_year > 42.50 | | | | | | | | | | |--- avg_price_per_room <= 61.38 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 61.38 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | |--- avg_price_per_room > 121.78 | | | | | | | | |--- room_type_reserved_Room_Type 7 <= 0.50 | | | | | | | | | |--- week_of_year <= 36.50 | | | | | | | | | | |--- lead_time <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- lead_time > 7.50 | | | | | | | | | | | |--- truncated branch of depth 25 | | | | | | | | | |--- week_of_year > 36.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 20 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | |--- room_type_reserved_Room_Type 7 > 0.50 | | | | | | | | | |--- avg_price_per_room <= 243.33 | | | | | | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 243.33 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- week_of_year > 50.50 | | | | | | | |--- no_of_adults <= 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- no_of_adults > 0.50 | | | | | | | | |--- weights: [22.35, 0.00] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- room_type_reserved_Room_Type 6 <= 0.50 | | | | | | | |--- weights: [22.35, 0.00] class: 0 | | | | | | |--- room_type_reserved_Room_Type 6 > 0.50 | | | | | | | |--- weights: [1.35, 0.00] class: 0 | |--- no_of_special_requests > 1.50 | | |--- no_of_week_nights <= 3.50 | | | |--- weights: [324.60, 0.00] class: 0 | | |--- no_of_week_nights > 3.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- week_of_year <= 15.50 | | | | | | |--- avg_price_per_room <= 130.11 | | | | | | | |--- week_of_year <= 6.50 | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | | |--- week_of_year > 6.50 | | | | | | | | |--- avg_price_per_room <= 97.57 | | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- avg_price_per_room > 97.57 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 130.11 | | | | | | | |--- room_type_reserved_Room_Type 6 <= 0.50 | | | | | | | | |--- lead_time <= 34.50 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | |--- lead_time > 34.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 6 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- week_of_year > 15.50 | | | | | | |--- avg_price_per_room <= 90.25 | | | | | | | |--- lead_time <= 45.00 | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- lead_time > 45.00 | | | | | | | | |--- avg_price_per_room <= 63.83 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 63.83 | | | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | | | |--- lead_time <= 71.50 | | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | | | |--- lead_time > 71.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 90.25 | | | | | | | |--- lead_time <= 11.00 | | | | | | | | |--- lead_time <= 8.00 | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | | |--- lead_time > 8.00 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- lead_time > 11.00 | | | | | | | | |--- week_of_year <= 16.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- week_of_year > 16.50 | | | | | | | | | |--- weights: [7.20, 0.00] class: 0 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- week_of_year <= 28.50 | | | | | | |--- avg_price_per_room <= 131.69 | | | | | | | |--- avg_price_per_room <= 123.75 | | | | | | | | |--- lead_time <= 31.50 | | | | | | | | | |--- avg_price_per_room <= 104.32 | | | | | | | | | | |--- week_of_year <= 6.50 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- week_of_year > 6.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- avg_price_per_room > 104.32 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- lead_time > 31.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 123.75 | | | | | | | | |--- avg_price_per_room <= 127.95 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- avg_price_per_room > 127.95 | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- avg_price_per_room > 131.69 | | | | | | | |--- weights: [2.70, 0.00] class: 0 | | | | | |--- week_of_year > 28.50 | | | | | | |--- week_of_year <= 50.50 | | | | | | | |--- lead_time <= 82.50 | | | | | | | | |--- lead_time <= 3.50 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | |--- lead_time > 3.50 | | | | | | | | | |--- lead_time <= 73.00 | | | | | | | | | | |--- avg_price_per_room <= 81.78 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- avg_price_per_room > 81.78 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | |--- lead_time > 73.00 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | |--- lead_time > 82.50 | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | |--- week_of_year > 50.50 | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- room_type_reserved_Room_Type 7 <= 0.50 | | | | | |--- weights: [10.35, 0.00] class: 0 | | | | |--- room_type_reserved_Room_Type 7 > 0.50 | | | | | |--- weights: [0.15, 0.00] class: 0 |--- lead_time > 90.50 | |--- lead_time <= 151.50 | | |--- no_of_special_requests <= 0.50 | | | |--- avg_price_per_room <= 93.01 | | | | |--- week_of_year <= 51.50 | | | | | |--- lead_time <= 150.50 | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | |--- weights: [3.15, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | | | |--- lead_time <= 121.50 | | | | | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | |--- lead_time > 121.50 | | | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | | | |--- weights: [5.10, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 73.62 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- avg_price_per_room > 73.62 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | |--- week_of_year <= 12.50 | | | | | | | | | | |--- lead_time <= 105.50 | | | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 105.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- week_of_year > 12.50 | | | | | | | | | | |--- avg_price_per_room <= 74.67 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- avg_price_per_room > 74.67 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | |--- avg_price_per_room <= 88.00 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [5.55, 0.00] class: 0 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 88.00 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- lead_time > 150.50 | | | | | | |--- weights: [7.80, 0.00] class: 0 | | | | |--- week_of_year > 51.50 | | | | | |--- avg_price_per_room <= 88.20 | | | | | | |--- weights: [9.60, 0.00] class: 0 | | | | | |--- avg_price_per_room > 88.20 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- avg_price_per_room > 93.01 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- week_of_year <= 31.50 | | | | | | | |--- lead_time <= 132.50 | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | |--- market_segment_type_Complementary <= 0.50 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- market_segment_type_Complementary > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- lead_time > 132.50 | | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 115.36 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 115.36 | | | | | | | | | | |--- week_of_year <= 22.00 | | | | | | | | | | | |--- weights: [0.00, 16.15] class: 1 | | | | | | | | | | |--- week_of_year > 22.00 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | | |--- week_of_year <= 29.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 29.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- week_of_year > 31.50 | | | | | | | |--- lead_time <= 91.50 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- lead_time > 91.50 | | | | | | | | |--- avg_price_per_room <= 93.72 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 93.72 | | | | | | | | | |--- avg_price_per_room <= 100.15 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 100.15 | | | | | | | | | | |--- week_of_year <= 43.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- week_of_year > 43.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- lead_time <= 111.50 | | | | | | | |--- week_of_year <= 11.50 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- week_of_year > 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- lead_time <= 91.50 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- lead_time > 91.50 | | | | | | | | | | |--- week_of_year <= 31.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- week_of_year > 31.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- lead_time <= 101.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- lead_time > 101.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- lead_time > 111.50 | | | | | | | |--- avg_price_per_room <= 96.73 | | | | | | | | |--- week_of_year <= 34.50 | | | | | | | | | |--- avg_price_per_room <= 94.25 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 94.25 | | | | | | | | | | |--- lead_time <= 142.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- lead_time > 142.00 | | | | | | | | | | | |--- weights: [0.00, 13.60] class: 1 | | | | | | | | |--- week_of_year > 34.50 | | | | | | | | | |--- weights: [1.95, 1.70] class: 0 | | | | | | | |--- avg_price_per_room > 96.73 | | | | | | | | |--- avg_price_per_room <= 113.00 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- avg_price_per_room > 113.00 | | | | | | | | | |--- week_of_year <= 38.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- week_of_year > 38.50 | | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [3.15, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- week_of_year <= 21.00 | | | | | | | |--- lead_time <= 143.50 | | | | | | | | |--- lead_time <= 98.50 | | | | | | | | | |--- lead_time <= 95.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- lead_time > 95.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- lead_time > 98.50 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- weights: [0.75, 0.85] class: 1 | | | | | | | |--- lead_time > 143.50 | | | | | | | | |--- avg_price_per_room <= 88.83 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 88.83 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- week_of_year > 21.00 | | | | | | | |--- avg_price_per_room <= 83.39 | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | |--- avg_price_per_room <= 60.88 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 60.88 | | | | | | | | | | |--- lead_time <= 97.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 97.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | |--- avg_price_per_room > 83.39 | | | | | | | | |--- lead_time <= 130.00 | | | | | | | | | |--- week_of_year <= 38.50 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- week_of_year > 38.50 | | | | | | | | | | |--- avg_price_per_room <= 137.00 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 137.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- lead_time > 130.00 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- weights: [2.25, 0.00] class: 0 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- avg_price_per_room <= 188.57 | | | | | | | |--- week_of_year <= 7.50 | | | | | | | | |--- avg_price_per_room <= 122.00 | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 122.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- week_of_year > 7.50 | | | | | | | | |--- week_of_year <= 48.00 | | | | | | | | | |--- weights: [15.60, 0.00] class: 0 | | | | | | | | |--- week_of_year > 48.00 | | | | | | | | | |--- week_of_year <= 50.50 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 50.50 | | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 188.57 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- lead_time <= 150.50 | | | | | | | |--- avg_price_per_room <= 200.86 | | | | | | | | |--- week_of_year <= 35.50 | | | | | | | | | |--- avg_price_per_room <= 76.54 | | | | | | | | | | |--- avg_price_per_room <= 67.75 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 67.75 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- avg_price_per_room > 76.54 | | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 29 | | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | |--- week_of_year > 35.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- avg_price_per_room <= 129.30 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- avg_price_per_room > 129.30 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | |--- avg_price_per_room > 200.86 | | | | | | | | |--- weights: [0.00, 10.20] class: 1 | | | | | | |--- lead_time > 150.50 | | | | | | | |--- avg_price_per_room <= 95.39 | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 95.39 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [0.00, 10.20] class: 1 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- room_type_reserved_Room_Type 6 <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- room_type_reserved_Room_Type 6 > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- lead_time <= 150.00 | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | |--- weights: [10.05, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- lead_time > 150.00 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [13.50, 0.00] class: 0 | |--- lead_time > 151.50 | | |--- avg_price_per_room <= 100.04 | | | |--- no_of_special_requests <= 0.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- lead_time <= 163.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | |--- weights: [0.00, 14.45] class: 1 | | | | | | |--- lead_time > 163.50 | | | | | | | |--- lead_time <= 340.50 | | | | | | | | |--- lead_time <= 173.00 | | | | | | | | | |--- week_of_year <= 17.50 | | | | | | | | | | |--- avg_price_per_room <= 88.25 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 88.25 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- week_of_year > 17.50 | | | | | | | | | | |--- avg_price_per_room <= 86.60 | | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.60 | | | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | |--- lead_time > 173.00 | | | | | | | | | |--- avg_price_per_room <= 98.00 | | | | | | | | | | |--- avg_price_per_room <= 55.21 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- avg_price_per_room > 55.21 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | |--- avg_price_per_room > 98.00 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- lead_time > 340.50 | | | | | | | | |--- week_of_year <= 27.00 | | | | | | | | | |--- avg_price_per_room <= 88.33 | | | | | | | | | | |--- weights: [0.00, 7.65] class: 1 | | | | | | | | | |--- avg_price_per_room > 88.33 | | | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | | | | | |--- week_of_year > 27.00 | | | | | | | | | |--- avg_price_per_room <= 78.00 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 78.00 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- week_of_year <= 51.50 | | | | | | | |--- avg_price_per_room <= 25.84 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 25.84 | | | | | | | | |--- avg_price_per_room <= 99.95 | | | | | | | | | |--- week_of_year <= 50.50 | | | | | | | | | | |--- weights: [0.00, 46.75] class: 1 | | | | | | | | | |--- week_of_year > 50.50 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- avg_price_per_room > 99.95 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- week_of_year > 51.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- no_of_adults > 1.50 | | | | | |--- avg_price_per_room <= 81.34 | | | | | | |--- lead_time <= 163.50 | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | | |--- lead_time <= 161.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 161.50 | | | | | | | | | | |--- week_of_year <= 21.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- week_of_year > 21.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | |--- lead_time > 163.50 | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | |--- week_of_year <= 50.50 | | | | | | | | | |--- week_of_year <= 12.50 | | | | | | | | | | |--- avg_price_per_room <= 72.54 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 72.54 | | | | | | | | | | | |--- weights: [0.00, 8.50] class: 1 | | | | | | | | | |--- week_of_year > 12.50 | | | | | | | | | | |--- weights: [0.00, 111.35] class: 1 | | | | | | | | |--- week_of_year > 50.50 | | | | | | | | | |--- avg_price_per_room <= 80.72 | | | | | | | | | | |--- lead_time <= 217.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 217.00 | | | | | | | | | | | |--- weights: [0.00, 25.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 80.72 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | |--- week_of_year <= 13.00 | | | | | | | | | |--- lead_time <= 189.50 | | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 189.50 | | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | | |--- weights: [4.20, 0.00] class: 0 | | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- week_of_year > 13.00 | | | | | | | | | |--- avg_price_per_room <= 80.49 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- avg_price_per_room > 80.49 | | | | | | | | | | |--- lead_time <= 171.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- lead_time > 171.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | |--- avg_price_per_room > 81.34 | | | | | | |--- no_of_adults <= 2.50 | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- lead_time <= 324.50 | | | | | | | | | | |--- week_of_year <= 51.00 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- week_of_year > 51.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 324.50 | | | | | | | | | | |--- avg_price_per_room <= 89.00 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 89.00 | | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.20] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 83.35 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 83.35 | | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- no_of_adults > 2.50 | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- no_of_adults <= 2.50 | | | | | | | |--- no_of_week_nights <= 5.50 | | | | | | | | |--- lead_time <= 323.00 | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | |--- weights: [21.75, 0.00] class: 0 | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | | | | | |--- lead_time > 323.00 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | | |--- weights: [0.45, 0.85] class: 1 | | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 5.50 | | | | | | | | |--- lead_time <= 167.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- lead_time > 167.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- no_of_adults > 2.50 | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- lead_time <= 368.00 | | | | | | | |--- no_of_special_requests <= 2.00 | | | | | | | | |--- avg_price_per_room <= 58.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.50 | | | | | | | | | |--- weights: [0.60, 0.85] class: 1 | | | | | | | |--- no_of_special_requests > 2.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- lead_time > 368.00 | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | |--- lead_time <= 180.50 | | | | | | | |--- lead_time <= 160.00 | | | | | | | | |--- week_of_year <= 36.00 | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | |--- week_of_year > 36.00 | | | | | | | | | |--- no_of_week_nights <= 3.00 | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | | | |--- no_of_week_nights > 3.00 | | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- lead_time > 160.00 | | | | | | | | |--- week_of_year <= 1.50 | | | | | | | | | |--- lead_time <= 171.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- lead_time > 171.50 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- week_of_year > 1.50 | | | | | | | | | |--- weights: [6.15, 0.00] class: 0 | | | | | | |--- lead_time > 180.50 | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | |--- avg_price_per_room <= 99.97 | | | | | | | | | |--- week_of_year <= 51.50 | | | | | | | | | | |--- week_of_year <= 4.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- week_of_year > 4.00 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- week_of_year > 51.50 | | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- avg_price_per_room > 99.97 | | | | | | | | | |--- week_of_year <= 30.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- week_of_year > 30.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | |--- avg_price_per_room <= 68.15 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 68.15 | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | |--- week_of_year <= 50.50 | | | | | | | |--- avg_price_per_room <= 67.46 | | | | | | | | |--- lead_time <= 217.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 217.50 | | | | | | | | | |--- lead_time <= 225.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 225.00 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 67.46 | | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | | |--- week_of_year <= 40.50 | | | | | | | | | | |--- no_of_weekend_nights <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | | |--- no_of_weekend_nights > 2.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- week_of_year > 40.50 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | | |--- lead_time <= 172.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- lead_time > 172.50 | | | | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | | |--- week_of_year > 50.50 | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | |--- avg_price_per_room <= 55.91 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 55.91 | | | | | | | | | | |--- lead_time <= 155.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 155.00 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | |--- avg_price_per_room > 100.04 | | | |--- no_of_special_requests <= 2.50 | | | | |--- week_of_year <= 50.50 | | | | | |--- week_of_year <= 49.50 | | | | | | |--- week_of_year <= 1.50 | | | | | | | |--- lead_time <= 258.50 | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | |--- weights: [0.00, 35.70] class: 1 | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- lead_time > 258.50 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- week_of_year > 1.50 | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | |--- week_of_year <= 4.00 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 99.45] class: 1 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- week_of_year > 4.00 | | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 1021.70] class: 1 | | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | |--- avg_price_per_room <= 115.64 | | | | | | | | | | |--- lead_time <= 224.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- lead_time > 224.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- avg_price_per_room > 115.64 | | | | | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 6.80] class: 1 | | | | | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | | | | | |--- weights: [0.00, 142.80] class: 1 | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | |--- lead_time <= 272.50 | | | | | | | | | |--- weights: [0.00, 11.05] class: 1 | | | | | | | | |--- lead_time > 272.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- week_of_year > 49.50 | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | |--- weights: [0.00, 30.60] class: 1 | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- week_of_year > 50.50 | | | | | |--- no_of_special_requests <= 0.50 | | | | | | |--- weights: [5.25, 0.00] class: 0 | | | | | |--- no_of_special_requests > 0.50 | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | |--- avg_price_per_room <= 145.15 | | | | | | | | |--- weights: [0.00, 7.65] class: 1 | | | | | | | |--- avg_price_per_room > 145.15 | | | | | | | | |--- avg_price_per_room <= 147.19 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 147.19 | | | | | | | | | |--- lead_time <= 229.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 229.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | |--- avg_price_per_room <= 118.15 | | | | | | | | |--- week_of_year <= 51.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- week_of_year > 51.50 | | | | | | | | | |--- avg_price_per_room <= 106.53 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 106.53 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- avg_price_per_room > 118.15 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- weights: [4.35, 0.00] class: 0 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- weights: [0.15, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp lead_time 0.30 avg_price_per_room 0.17 no_of_special_requests 0.14 week_of_year 0.12 market_segment_type_Online 0.09 no_of_week_nights 0.05 no_of_weekend_nights 0.04 no_of_adults 0.02 required_car_parking_space 0.01 market_segment_type_Offline 0.01 room_type_reserved_Room_Type 4 0.01 type_of_meal_plan_Not Selected 0.01 no_of_children 0.01 repeated_guest 0.00 type_of_meal_plan_Meal Plan 2 0.00 room_type_reserved_Room_Type 5 0.00 market_segment_type_Corporate 0.00 market_segment_type_Complementary 0.00 room_type_reserved_Room_Type 2 0.00 room_type_reserved_Room_Type 6 0.00 no_of_previous_bookings_not_canceled 0.00 room_type_reserved_Room_Type 7 0.00 no_of_previous_cancellations 0.00 room_type_reserved_Room_Type 3 0.00 type_of_meal_plan_Meal Plan 3 0.00
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=5,
min_impurity_decrease=0.01, random_state=1,
splitter='random')
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = get_recall_score(estimator, X_train, y_train)
print("Recall Score:", decision_tree_tune_perf_train)
Recall Score: 1.0
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = get_recall_score(estimator, X_test, y_test)
print("Recall Score:", decision_tree_tune_perf_test)
Recall Score: 1.0
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- weights: [2552.10, 7099.20] class: 1
Observations from the tree:
Using the above extracted decision rules we can make interpretations from the decision tree model like:
|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type <= 0.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [91.80, 162.35] class: 1
| | | |--- no_of_weekend_nights > 0.50 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- weights: [212.85, 221.00] class: 1
You can keep readting the tree to find similar classes ... `Interpretations from other decision rules can be made similarly`
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp no_of_adults 0.00 type_of_meal_plan_Meal Plan 3 0.00 market_segment_type_Offline 0.00 market_segment_type_Corporate 0.00 market_segment_type_Complementary 0.00 room_type_reserved_Room_Type 7 0.00 room_type_reserved_Room_Type 6 0.00 room_type_reserved_Room_Type 5 0.00 room_type_reserved_Room_Type 4 0.00 room_type_reserved_Room_Type 3 0.00 room_type_reserved_Room_Type 2 0.00 type_of_meal_plan_Not Selected 0.00 type_of_meal_plan_Meal Plan 2 0.00 no_of_children 0.00 week_of_year 0.00 no_of_special_requests 0.00 avg_price_per_room 0.00 no_of_previous_bookings_not_canceled 0.00 no_of_previous_cancellations 0.00 repeated_guest 0.00 lead_time 0.00 required_car_parking_space 0.00 no_of_week_nights 0.00 no_of_weekend_nights 0.00 market_segment_type_Online 0.00
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
##Note I took the ABS value as alpha has to be between 0 and 1. There are some negative numbers in the data, so this resolves and allows the analysis to conintue.
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00 | 0.01 |
| 1 | 0.00 | 0.01 |
| 2 | 0.00 | 0.01 |
| 3 | 0.00 | 0.01 |
| 4 | 0.00 | 0.01 |
| ... | ... | ... |
| 1930 | 0.00 | 0.27 |
| 1931 | 0.01 | 0.28 |
| 1932 | 0.01 | 0.29 |
| 1933 | 0.02 | 0.34 |
| 1934 | 0.05 | 0.39 |
1935 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04652385380997953
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
Maximum value of Recall is at 0.025 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.002~3 retaining information and getting higher recall.
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.024674651341818815,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.024674651341818815,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train)
print("Recall Score:", get_recall_score(best_model, X_train, y_train))
Recall Score: 1.0
confusion_matrix_sklearn(best_model, X_test, y_test)
print("Recall Score:", get_recall_score(best_model, X_test, y_test))
Recall Score: 1.0
plt.figure(figsize=(5, 5))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Creating model with < 0.005 ccp_alpha
best_model2 = DecisionTreeClassifier(
ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85},
random_state=1)
confusion_matrix_sklearn(best_model2, X_train, y_train)
decision_tree_postpruned_perf_train = get_recall_score(best_model2, X_train, y_train)
print("Recall Score:", decision_tree_postpruned_perf_train)
Recall Score: 0.9614463601532567
confusion_matrix_sklearn(best_model2, X_test, y_test)
decision_tree_postpruned_perf_test = get_recall_score(best_model2, X_test, y_test)
print("Recall Score:", decision_tree_postpruned_perf_test)
Recall Score: 0.9653998865570051
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
best_model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))
|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- weights: [146.55, 82.45] class: 0 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- avg_price_per_room <= 199.01 | | | | | | |--- weights: [276.00, 0.00] class: 0 | | | | | |--- avg_price_per_room > 199.01 | | | | | | |--- weights: [0.15, 12.75] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- lead_time <= 65.50 | | | | | | |--- weights: [183.75, 126.65] class: 0 | | | | | |--- lead_time > 65.50 | | | | | | |--- weights: [29.10, 94.35] class: 1 | | | | |--- no_of_special_requests > 0.50 | | | | | |--- weights: [74.40, 4.25] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [91.80, 162.35] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- weights: [229.95, 1528.30] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- lead_time <= 4.50 | | | | | |--- weights: [77.85, 16.15] class: 0 | | | | |--- lead_time > 4.50 | | | | | |--- weights: [488.85, 625.60] class: 1 | |--- no_of_special_requests > 1.50 | | |--- no_of_week_nights <= 3.50 | | | |--- weights: [324.60, 0.00] class: 0 | | |--- no_of_week_nights > 3.50 | | | |--- weights: [44.10, 37.40] class: 0 |--- lead_time > 90.50 | |--- lead_time <= 151.50 | | |--- no_of_special_requests <= 0.50 | | | |--- weights: [168.75, 996.20] class: 1 | | |--- no_of_special_requests > 0.50 | | | |--- weights: [214.35, 319.60] class: 1 | |--- lead_time > 151.50 | | |--- avg_price_per_room <= 100.04 | | | |--- no_of_special_requests <= 0.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- weights: [50.70, 103.70] class: 1 | | | | |--- no_of_adults > 1.50 | | | | | |--- weights: [46.95, 956.25] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- weights: [27.90, 6.80] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- weights: [62.85, 197.20] class: 1 | | |--- avg_price_per_room > 100.04 | | | |--- weights: [13.50, 1829.20] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp lead_time 0.39 no_of_special_requests 0.31 market_segment_type_Online 0.19 avg_price_per_room 0.03 no_of_weekend_nights 0.02 no_of_week_nights 0.02 market_segment_type_Offline 0.02 no_of_adults 0.01 no_of_previous_bookings_not_canceled 0.00 room_type_reserved_Room_Type 4 0.00 required_car_parking_space 0.00 market_segment_type_Corporate 0.00 market_segment_type_Complementary 0.00 room_type_reserved_Room_Type 7 0.00 room_type_reserved_Room_Type 6 0.00 room_type_reserved_Room_Type 5 0.00 room_type_reserved_Room_Type 3 0.00 no_of_previous_cancellations 0.00 room_type_reserved_Room_Type 2 0.00 type_of_meal_plan_Not Selected 0.00 type_of_meal_plan_Meal Plan 3 0.00 no_of_children 0.00 week_of_year 0.00 repeated_guest 0.00 type_of_meal_plan_Meal Plan 2 0.00
importances = best_model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.DataFrame(
[
decision_tree_perf_train,
decision_tree_tune_perf_train,
decision_tree_postpruned_perf_train,
],
columns=["Recall on training set"],
)
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Recall on training set | |
|---|---|
| 0 | 1.00 |
| 1 | 1.00 |
| 2 | 0.96 |
# testing performance comparison
models_test_comp_df = pd.DataFrame(
[
decision_tree_perf_test,
decision_tree_tune_perf_test,
decision_tree_postpruned_perf_test,
],
columns=["Recall on testing set"],
)
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Recall on testing set | |
|---|---|
| 0 | 0.80 |
| 1 | 1.00 |
| 2 | 0.97 |
While the decsion Trees allows us to make even more discrete A/B tests compared to the logistic model, the basic plrinciples are the same:
To drive the likeihood of decreasing cancelations build pricing and programs around:
Online booking is barrier free, and most of the cancelations come from that segment:
EXAMPLES - Each of these likely buckets of folks that will cancel could be target through A/B to test various levels of fees (Incentives)
While both models produce results, if you consider comparing the Logistic Regresss (Statsmodel & Sklearn) the following observations can be made
Logistic
Decision Tree
What also needs to be conisered is the production environment. If stable, menaing the data comes in with known defects and remains stable possibly the Logistic model can be used as a second check where the delat between the two becomes key metric or warning signal if something changes.
As for the decsiion tree, computationally the environment would need to be sized accordingly to ensure performance isn't an issue. If it is for some reason not feasible to run the DT in real-time ... use the Logistic model to screen the "easy" decesions and the DT to run a second pass against those that are not clearly candidates for approval.
# Logistic Reggression testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.33 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.33 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80 | 0.77 | 0.79 |
| Recall | 0.63 | 0.77 | 0.70 |
| Precision | 0.71 | 0.62 | 0.67 |
| F1 | 0.67 | 0.69 | 0.68 |
# Decision Tree (training performance comparison)
models_train_comp_df = pd.DataFrame(
[
decision_tree_perf_train,
decision_tree_tune_perf_train,
decision_tree_postpruned_perf_train,
],
columns=["Recall on training set"],
)
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Recall on training set | |
|---|---|
| 0 | 1.00 |
| 1 | 1.00 |
| 2 | 0.96 |
# Decision Tree (testing performance comparison)
models_test_comp_df = pd.DataFrame(
[
decision_tree_perf_test,
decision_tree_tune_perf_test,
decision_tree_postpruned_perf_test,
],
columns=["Recall on testing set"],
)
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Recall on testing set | |
|---|---|
| 0 | 0.80 |
| 1 | 1.00 |
| 2 | 0.97 |